Objectives

Question

First, let’s get our environment ready by loading the data and necessary packages. Tidyverse includes dplyr and ggplot2, which we will be using here.

gapminder <- read.csv("data/gapminder_data.csv", header = TRUE)
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.4.3
## 
## Attaching package: 'gapminder'
## The following object is masked _by_ '.GlobalEnv':
## 
##     gapminder
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Manipulation of data frames means many things to many researchers: we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations you learned in Jason’s segment.

Let’s find the average gdp for each continent in the gapminder dataset. First, view the different continents present in the gapminder dataset.

levels(as.factor(gapminder$continent))
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Quick memory jog – why do we have to include the as.factor() command inside our levels command?

Next, let’s take the mean for each continent.

mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
## [1] 2193.755
mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
## [1] 7136.11
mean(gapminder$gdpPercap[gapminder$continent == "Asia"])
## [1] 7902.15
mean(gapminder$gdpPercap[gapminder$continent == "Europe"])
## [1] 14469.48
mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])
## [1] 18621.61

But this isn’t very nice because there is a fair bit of repetition. Imagine doing this for each country in the dataset (there are 142)! Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs (did you make any typos while typing out those 5 lines of code?).

The dplyr package

Luckily, the dplyr package provides a number of very useful functions for manipulating data frames in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Tip: Tidyverse

dplyr package belongs to a broader family of opinionated R packages designed for data science called the “Tidyverse”. These packages are specifically designed to work harmoniously together. Some of these packages will be covered along this course, but you can find more complete information here: https://www.tidyverse.org/.

Here we’re going to cover 5 of the most commonly used functions as well as using pipes (%>%) to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()

If you have have not installed this package earlier, please do so:

install.packages('dplyr')

Now let’s load the package:

library("dplyr")

Using select()

If, for example, we wanted to move forward with only a few of the variables in our data frame we could use the select() function. This will keep only the variables you select.

year_country_gdp <- select(gapminder, year, country, gdpPercap)

Diagram illustrating use of select function to select two columns of a data frame

If we want to remove one column only from the gapminder data, for example, removing the continent column.

smaller_gapminder_data <- select(gapminder, -continent)

If we open up year_country_gdp we’ll see that it only contains the year, country and gdpPercap, and if we look at the smaller_gapminder_data object in the environment, we can see it has one fewer variable than gapminder, since we removed the continent column.

Above we used ‘normal’ grammar, where the first argument provides the data frame we are working on and subsequent arguments give the columns to operatre on. But the strengths of dplyr lie in combining several functions using “pipes”. Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

year_country_gdp <- gapminder %>% select(year, country, gdpPercap)

To help you understand why we wrote that in that way, let’s walk through it step by step. First we summon the gapminder data frame and pass it on, using the pipe symbol %>%, to the next step, which is the select() function. In this case we don’t specify which data object we use in the select() function since in gets that from the previous pipe.

Pipes are very handy and can help you keep your code looking nice and tidy! The keyboard shortcut for adding a pipe is: cmd + shift + m on a mac and ctrl + shift + m on a PC.

Tip: Renaming data frame columns in dplyr

In Chapter 4 we covered how you can rename columns with base R by assigning a value to the output of the names() function. Just like select, this is a bit cumbersome, but thankfully dplyr has a rename() function.

Within a pipeline, the syntax is rename(new_name = old_name). For example, we may want to rename the gdpPercap column name from our select() statement above.

tidy_gdp <- year_country_gdp %>% rename(gdp_pc = gdpPercap)

head(tidy_gdp)
##   year     country   gdp_pc
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134

Quick challenge: Why doesn’t the rename command require a double equals sign?

Using filter()

If we now want to move forward with the above, but only with European countries, we can combine select and filter

year_country_gdp_euro <- gapminder %>% 
  filter(continent == "Europe") %>% #subset to only countries in Europe
  select(year, country, gdpPercap) #select these three columns only

Note how you can use separate lines for different functions, and they are still connected by the pipe symbol. This keeps the code neat, and facilitates helpful annotation placement. The pipe symbol must appear at the end of the line if you use this (very common) notation style.

If we now want to show life expectancy of European countries but only for a specific year (e.g., 2007), we can do as below.

europe_lifeExp_2007 <- gapminder %>%
  filter(continent == "Europe", year == 2007) %>%
  select(country, lifeExp)

Here we used a comma to separate multiple conditions we’d like to be true within our filter() function. The comma (,) operator works exactly the same as the ampersand (&) operator in these functions, indicating that both conditions must be true for a row to be included in the resulting, filtered data frame. There is also an “or” operator, the | symbol, which is accessed via shift + backslash.

Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a data frame that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your data frame have and why? How else could you check that this worked properly?

Solution to Challenge 1:

year_country_lifeExp_Africa <- gapminder %>%
                           filter(continent == "Africa") %>%
                           select(year, country, lifeExp)

#check
levels(as.factor(year_country_lifeExp_Africa$country)) #this looks correct
##  [1] "Algeria"                  "Angola"                  
##  [3] "Benin"                    "Botswana"                
##  [5] "Burkina Faso"             "Burundi"                 
##  [7] "Cameroon"                 "Central African Republic"
##  [9] "Chad"                     "Comoros"                 
## [11] "Congo Dem. Rep."          "Congo Rep."              
## [13] "Cote d'Ivoire"            "Djibouti"                
## [15] "Egypt"                    "Equatorial Guinea"       
## [17] "Eritrea"                  "Ethiopia"                
## [19] "Gabon"                    "Gambia"                  
## [21] "Ghana"                    "Guinea"                  
## [23] "Guinea-Bissau"            "Kenya"                   
## [25] "Lesotho"                  "Liberia"                 
## [27] "Libya"                    "Madagascar"              
## [29] "Malawi"                   "Mali"                    
## [31] "Mauritania"               "Mauritius"               
## [33] "Morocco"                  "Mozambique"              
## [35] "Namibia"                  "Niger"                   
## [37] "Nigeria"                  "Reunion"                 
## [39] "Rwanda"                   "Sao Tome and Principe"   
## [41] "Senegal"                  "Sierra Leone"            
## [43] "Somalia"                  "South Africa"            
## [45] "Sudan"                    "Swaziland"               
## [47] "Tanzania"                 "Togo"                    
## [49] "Tunisia"                  "Uganda"                  
## [51] "Zambia"                   "Zimbabwe"

Challenge 2

Now, repeat the exercise above, but include values for lifeExp, country, and year for both Africa and Oceania, but not the other continents. How many rows does this data frame have and why?

Solution to challenge 2:

year_country_lifeExp_Afr_Oce <- gapminder %>% 
  filter(continent == "Africa" | continent == "Oceania") %>% 
  select(year, country, lifeExp)

#check
levels(as.factor(year_country_lifeExp_Afr_Oce$country)) #this looks correct
##  [1] "Algeria"                  "Angola"                  
##  [3] "Australia"                "Benin"                   
##  [5] "Botswana"                 "Burkina Faso"            
##  [7] "Burundi"                  "Cameroon"                
##  [9] "Central African Republic" "Chad"                    
## [11] "Comoros"                  "Congo Dem. Rep."         
## [13] "Congo Rep."               "Cote d'Ivoire"           
## [15] "Djibouti"                 "Egypt"                   
## [17] "Equatorial Guinea"        "Eritrea"                 
## [19] "Ethiopia"                 "Gabon"                   
## [21] "Gambia"                   "Ghana"                   
## [23] "Guinea"                   "Guinea-Bissau"           
## [25] "Kenya"                    "Lesotho"                 
## [27] "Liberia"                  "Libya"                   
## [29] "Madagascar"               "Malawi"                  
## [31] "Mali"                     "Mauritania"              
## [33] "Mauritius"                "Morocco"                 
## [35] "Mozambique"               "Namibia"                 
## [37] "New Zealand"              "Niger"                   
## [39] "Nigeria"                  "Reunion"                 
## [41] "Rwanda"                   "Sao Tome and Principe"   
## [43] "Senegal"                  "Sierra Leone"            
## [45] "Somalia"                  "South Africa"            
## [47] "Sudan"                    "Swaziland"               
## [49] "Tanzania"                 "Togo"                    
## [51] "Tunisia"                  "Uganda"                  
## [53] "Zambia"                   "Zimbabwe"

As with last time, first we pass the gapminder data frame to the filter() function, then we pass the filtered version of the gapminder data frame to the select() function. Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable continent since we would have removed it in the previous step.

Using group_by()

Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each continent. Instead of filter(), which will only pass observations that meet your criteria (in the above: continent=="Europe"), we can use group_by(), which will essentially use every unique criteria (i.e. level) that you could have used in filter.

str(gapminder)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
str(gapminder %>% group_by(continent))
## gropd_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ country  : chr [1:1704] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num [1:1704] 8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: chr [1:1704] "Asia" "Asia" "Asia" "Asia" ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
##  - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ continent: chr [1:5] "Africa" "Americas" "Asia" "Europe" ...
##   ..$ .rows    : list<int> [1:5] 
##   .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
##   .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
##   .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
##   .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

You will notice that the structure of the data frame where we used group_by() (grouped_df) is not the same as the original gapminder (data.frame). A grouped_df can be thought of as a list where each item in the listis a data.frame which contains only the rows that correspond to the a particular value continent (at least in the example above).

Diagram illustrating how the group by function oraganizes a data frame into groups

Using summarize()

The above was a bit on the uneventful side but group_by() is much more exciting in conjunction with summarize(). This will allow us to create new variable(s) by using functions that repeat for each of the continent-specific data frames. That is to say, using the group_by() function, we split our original data frame into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().

Let’s calculate the mean gdp for each continent, like we did at the top of this lesson, but using group_by() and summarize()

gdp_bycontinents <- gapminder %>%
    group_by(continent) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))

Diagram illustrating the use of group by and summarize together to create a new variable

continent mean_gdpPercap
     <fctr>          <dbl>
1    Africa       2193.755
2  Americas       7136.110
3      Asia       7902.150
4    Europe      14469.476
5   Oceania      18621.609

That allowed us to calculate the mean gdpPercap for each continent, but it gets even better.

Challenge 3

  1. Calculate the average life expectancy per country.
  2. Use filter() to determine which country has the longest average life expectancy and which has the shortest average life expectancy?

Solution to challenge 3:

lifeExp_bycountry <- gapminder %>%
   group_by(country) %>%
   summarize(mean_lifeExp = mean(lifeExp))

lifeExp_bycountry %>%
   filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
## # A tibble: 2 × 2
##   country      mean_lifeExp
##   <chr>               <dbl>
## 1 Iceland              76.5
## 2 Sierra Leone         36.8

Another way to do this is to use the dplyr function arrange(), which arranges the rows in a data frame according to the order of one or more variables from the data frame. It has similar syntax to other functions from the dplyr package. You can use desc() inside arrange() to sort in descending order.

lifeExp_bycountry %>%
   arrange(mean_lifeExp) %>% #default arrangement is ascending order
   head(1) #just give the first row
## # A tibble: 1 × 2
##   country      mean_lifeExp
##   <chr>               <dbl>
## 1 Sierra Leone         36.8
lifeExp_bycountry %>%
   arrange(desc(mean_lifeExp)) %>%
   head(1) 
## # A tibble: 1 × 2
##   country mean_lifeExp
##   <chr>          <dbl>
## 1 Iceland         76.5

Alphabetical order works too

lifeExp_bycountry %>%
   arrange(desc(country)) %>%
   head()
## # A tibble: 6 × 2
##   country            mean_lifeExp
##   <chr>                     <dbl>
## 1 Zimbabwe                   52.7
## 2 Zambia                     46.0
## 3 Yemen Rep.                 46.8
## 4 West Bank and Gaza         60.3
## 5 Vietnam                    57.5
## 6 Venezuela                  66.6

The function group_by() allows us to group by multiple variables. Let’s group by year and continent.

gdp_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>% 
    summarize(mean_gdpPercap = mean(gdpPercap))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize().

gdp_pop_bycontinents_byyear <- gapminder %>%
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

count() and n()

A very common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.

For instance, if we wanted to check the number of countries included in the dataset for the year 2002, we can use the count() function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort=TRUE:

gapminder %>%
    filter(year == 2002) %>%
    count(continent, sort = TRUE)
##   continent  n
## 1    Africa 52
## 2      Asia 33
## 3    Europe 30
## 4  Americas 25
## 5   Oceania  2

If we need to use the number of observations in calculations, the n() function is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the life expectancy per continent:

gapminder %>%
    group_by(continent) %>%
    summarize(se_le = sd(lifeExp)/sqrt(n()))
## # A tibble: 5 × 2
##   continent se_le
##   <chr>     <dbl>
## 1 Africa    0.366
## 2 Americas  0.540
## 3 Asia      0.596
## 4 Europe    0.286
## 5 Oceania   0.775

You can also chain together several summary operations; in this case calculating the minimum, maximum, mean and se of each continent’s per-country life-expectancy:

gapminder %>%
    group_by(continent) %>%
    summarize(
      mean_le = mean(lifeExp),
      min_le = min(lifeExp),
      max_le = max(lifeExp),
      se_le = sd(lifeExp)/sqrt(n()))
## # A tibble: 5 × 5
##   continent mean_le min_le max_le se_le
##   <chr>       <dbl>  <dbl>  <dbl> <dbl>
## 1 Africa       48.9   23.6   76.4 0.366
## 2 Americas     64.7   37.6   80.7 0.540
## 3 Asia         60.1   28.8   82.6 0.596
## 4 Europe       71.9   43.6   81.8 0.286
## 5 Oceania      74.3   69.1   81.2 0.775

Using mutate()

We can also create new variables prior to (or even after) summarizing information using mutate().

gdp_pop_bycontinents_byyear <- gapminder %>%
    mutate(gdp_billion = gdpPercap*pop/10^9) %>%
    group_by(continent,year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop),
              mean_gdp_billion = mean(gdp_billion),
              sd_gdp_billion = sd(gdp_billion))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

Connect mutate with logical filtering: ifelse

When creating new variables, we can hook this with a logical condition. A simple combination of mutate() and ifelse() facilitates filtering right where it is needed: in the moment of creating something new. This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension of the data frame will not change) or for updating values depending on this given condition.

Let’s calculate the gdp per billion people, only for people with a life expectancy greater than 25, and store that value in a new column

gdp_pop_bycontinents_byyear_above25 <- gapminder %>%
    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) 

Let’s break that down. mutate builds a new column, with name gdp_billion. Inside our ifelse statement, we first list the condition for a given observation – here our condition is “life expectancy of this country in this year is over 25 years”. Then, we tell R what to put in the gdp_billion column if that condition is met, here the calculation, then what to put in the gdp_billion column if that condition is not met, here NA.

Now, let’s summarize that information, along with gdpPercap and pop, by country and year.

gdp_pop_bycontinents_byyear_above25 <- gapminder %>% 
    mutate(gdp_billion = ifelse(lifeExp > 25, gdpPercap * pop / 10^9, NA)) %>% 
    group_by(continent, year) %>%
    summarize(mean_gdpPercap = mean(gdpPercap),
              sd_gdpPercap = sd(gdpPercap),
              mean_pop = mean(pop),
              sd_pop = sd(pop),
              mean_gdp_billion = mean(gdp_billion),
              sd_gdp_billion = sd(gdp_billion))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

Combining dplyr and ggplot2

In the plotting lesson we looked at how to make a multi-panel figure by adding a layer of facet panels using ggplot2. Here is the code we used (with some extra comments):

# Filter countries located in the Americas
americas <- gapminder[gapminder$continent == "Americas", ]

# Make the plot
ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap( ~ country) +
  theme(axis.text.x = element_text(angle = 45))

This code makes the right plot but it also creates an intermediate variable (americas) that we might not have any other uses for. Just as we used %>% to pipe data along a chain of dplyr functions we can use it to pass data to ggplot(). Because %>% replaces the first argument in a function we don’t need to specify the data = argument in the ggplot() function. By combining dplyr and ggplot2 functions we can make the same figure without creating any new variables or modifying the data.

gapminder %>% 
  filter(continent == "Americas") %>% # Filter countries located in the Americas
  ggplot(mapping = aes(x = year, y = lifeExp)) + # Make the plot
  geom_line() +
  facet_wrap( ~ country) +
  theme(axis.text.x = element_text(angle = 45))

More examples of using the function mutate() and the ggplot2 package.

gapminder %>%
  mutate(startsWith = substr(country, 1, 1)) %>%  # extract first letter of country name into new column
  filter(startsWith %in% c("A", "Z")) %>% # only keep countries starting with A or Z
  ggplot(aes(x = year, y = lifeExp, colour = continent)) + # plot lifeExp into facets
  geom_line() +
  facet_wrap(~country) +
  theme_minimal()

Challenge 4

Make a graph that shows how population size has changed through time in each country that begins with the same letter as your given name (unless your first name begins with “X” or “Q”, in which case, try your surname!). Colour the data by whether or not you were born yet (hint: create a new variable that does this using the mutate and ifelse functions). Use facets to display each country separately.

Bonus: Look up labs, shape, linetype, theme, or facet_grid in the ggplot2 documentation and add one more degree of customization to your graph!

Solution to challenge 4:

gapminder %>% 
  mutate(startsWith = substr(country, 1, 1)) %>% 
  filter(startsWith == "K" | startsWith == "D") %>% #No Q or X in dataset
  mutate(year_before_me = ifelse(year >= 1994, "NO", "YES")) %>% 
  ggplot(aes(x = year, y = pop, colour = year_before_me, shape = continent)) + 
  geom_point() + 
  facet_grid(~country)

Advanced Challenge

Calculate the average life expectancy in 2002 of 2 randomly selected countries for each continent. Then arrange the continent names in reverse order. Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions.

Solution to Advanced Challenge:

lifeExp_2countries_bycontinents <- gapminder %>%
   filter(year == 2002) %>%
   group_by(continent) %>%
   sample_n(2) %>%
   summarize(mean_lifeExp = mean(lifeExp)) %>%
   arrange(desc(mean_lifeExp))

Other great resources

  • Use the dplyr package to manipulate data frames.
  • Use select() to choose variables from a data frame.
  • Use filter() to choose data based on values.
  • Use group_by() and summarize() to work with subsets of data.
  • Use mutate() to create new variables.